























## **Replacement Policy**

- Direct mapped no choice.
- Set associative:
  - Prefer non-valid entry, if there is one.
  - Otherwise, choose among entries in the set.
- Least-recently used (LRU) is common:
  - Choose the one unused for the longest time:
    - Simple for 2-way, manageable for 4-way, too complicated beyond that.
- Random
  - Oddly, gives about the same performance as LRU for high associativity.









|                               | nines' Cache Parameters              |                                               |
|-------------------------------|--------------------------------------|-----------------------------------------------|
| Wo Machines Odene i arameters |                                      |                                               |
| Characteristic                | ARM Cortex-A8                        | Intel Nehalem                                 |
| L1 cache organization         | Split instruction and data caches    | Split instruction and data caches             |
| L1 cache size                 | 32 KiB each for instructions/data    | 32 KiB each for instructions/data<br>per core |
| L1 cache associativity        | 4-way (I), 4-way (D) set associative | 4-way (I), 8-way (D) set associative          |
| L1 replacement                | Random                               | Approximated LRU                              |
| L1 block size                 | 64 bytes                             | 64 bytes                                      |
| L1 write policy               | Write-back, Write-allocate(?)        | Write-back, No-write-allocate                 |
| L1 hit time (load-use)        | 1 clock cycle                        | 4 clock cycles, pipelined                     |
| L2 cache organization         | Unified (instruction and data)       | Unified (instruction and data) per core       |
| L2 cache size                 | 128 KiB to 1 MiB                     | 256 KiB (0.25 MiB)                            |
| L2 cache associativity        | 8-way set associative                | 8-way set associative                         |
| L2 replacement                | Random(?)                            | Approximated LRU                              |
| L2 block size                 | 64 bytes                             | 64 bytes                                      |
| L2 write policy               | Write-back, Write-allocate (?)       | Write-back, Write-allocate                    |
| L2 hit time                   | 11 clock cycles                      | 10 clock cycles                               |
| L3 cache organization         | -                                    | Unified (instruction and data)                |
| L3 cache size                 | -                                    | 8 MiB, shared                                 |
| L3 cache associativity        | -                                    | 16-way set associative                        |
| L3 replacement                | -                                    | Approximated LRU                              |
| L3 block size                 |                                      | 64 bytes                                      |
| L3 write policy               | -                                    | Write-back, Write-allocate                    |
| L3 hit time                   | -                                    | 35 clock cycles                               |





## Summary: Improving Cache Performance

- 1. Reduce the time to hit in the cache:
  - Smaller cache.
  - Direct mapped cache.
  - Smaller blocks.
  - For writes:
    - No write allocate no "hit" on cache, just write to write buffer.
    - Write allocate to avoid two cycles (first check for hit, then write) pipeline writes via a delayed write buffer to cache.
- 2. Reduce the miss rate:
  - Bigger cache.
  - More flexible placement (increase associativity).
  - Larger blocks (16 to 64 bytes typical).
  - Victim cache small buffer holding most recently replaced blocks.





